Introduction

This document demonstrates a complete single-cell RNA-seq analysis workflow using the PBMC dataset with Seurat V5. The analysis includes:

  • Data loading and preprocessing
  • Quality control and filtering
  • Normalization and scaling
  • Feature selection
  • Principal Component Analysis (PCA)
  • Clustering
  • UMAP visualization
  • Differential expression analysis
  • Cell type annotation (by canonical predetermined marker genes)

Load Required Libraries

library(dplyr)
library(Seurat) # v5.3
library(patchwork)

1. Load the PBMC Dataset

We load the 10X Genomics data using the Read10X function.

# Function: Read10X()
# Input: 
#   - genes.tsv (gene names, character vector)
#   - barcodes.tsv (cell barcodes, character vector)
#   - matrix.mtx (expression counts, sparse matrix format)
# Output: 
#   - pbmc.data: dgCMatrix (sparse matrix)
#   - Dimensions: 32738 genes (rows) × 2700 cells (columns)
#   - Data type: Integer counts (non-negative)

pbmc.data <- Read10X(data.dir = "/Users/outobi/Documents/Learn bioinformatics /scRNA by Seurat V5/filtered_gene_bc_matrices/hg19/")

Initialize the Seurat Object

We create a Seurat object with basic filtering: - min.cells = 3: Keep genes detected in at least 3 cells - min.features = 200: Keep cells with at least 200 detected genes

# Function: CreateSeuratObject()
# Input: 
#   - counts: sparse matrix (32738 genes × 2700 cells, integer counts)
#   - min.cells: numeric (3)
#   - min.features: numeric (200)
# Output: 
#   - pbmc: Seurat object
#   - Filtered to 13714 features × 2700 cells
#   - meta.data: data.frame with 3 columns (orig.ident, nCount_RNA, nFeature_RNA)
#     - orig.ident: character (project name)
#     - nCount_RNA: numeric (total counts per cell)
#     - nFeature_RNA: numeric (number of detected genes per cell)

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)
##  1 layer present: counts

The resulting object contains 13,714 features across 2,700 samples within 1 assay.

Seurat Object Structure (Step 1)

pbmc
├── assays
│   └── RNA
│       └── layers
│           └── counts (13714 genes × 2700 cells)
├── meta.data (cell-level metadata: orig.ident, nCount_RNA, nFeature_RNA)
└── project: pbmc3k

Examine Expression of Specific Genes

Let’s look at the expression of some marker genes (CD3D, TCL1A, MS4A1) across the first 30 cells:

pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
##                                                                    
## CD3D  4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2  3 . . . . . 3 4 1 5
## TCL1A . .  . . . . . . 1 . . . . . . . . . . .  . 1 . . . . . . . .
## MS4A1 . 6  . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .

2. Quality Control

Calculate QC Metrics

The [[ operator can add columns to object metadata directly! This is a great place to stash QC stats. We calculate the percentage of mitochondrial gene expression, which is an indicator of cell quality. High mitochondrial content often indicates dying cells.

# Function: PercentageFeatureSet()
# Input: 
#   - pbmc: Seurat object (13714 genes × 2700 cells)
#   - pattern: character ("^MT-", regex for mitochondrial genes)
# Output: 
#   - percent.mt: numeric vector (length 2700, one value per cell)
#   - Data type: numeric (percentage, 0-100)
#   - Added to pbmc@meta.data as new column

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

# View assays and metadata
pbmc@assays
## $RNA
## Assay (v5) data with 13714 features for 2700 cells
## First 10 features:
##  AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9, LINC00115, NOC2L,
## KLHL17, PLEKHN1, RP11-54O7.17, HES4 
## Layers:
##  counts
head(pbmc@meta.data)
##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
## AAACGCACTGGTAC-1     pbmc3k       2163          781  1.6643551

Visualize QC Metrics

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

Feature-Feature Relationships

Examine relationships between QC metrics:

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

Filter Cells

Based on the QC metrics, we filter cells to retain high-quality cells: - Between 200 and 2500 detected genes - Less than 5% mitochondrial content

# Function: subset()
# Input: 
#   - pbmc: Seurat object (13714 genes × 2700 cells)
#   - subset: logical expression (filtering criteria)
# Output: 
#   - pbmc: Seurat object (filtered)
#   - Dimensions: 13714 genes × 2638 cells (62 cells removed)
#   - Retains only high-quality cells

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc
## An object of class Seurat 
## 13714 features across 2638 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)
##  1 layer present: counts

Seurat Object Structure (Step 2)

pbmc
├── assays
│   └── RNA
│       └── layers
│           └── counts (filtered cells)
├── meta.data (cell-level metadata: orig.ident, nCount_RNA, nFeature_RNA, percent.mt)
└── project: pbmc3k

3. Normalization

We normalize the data using log-normalization with a scale factor of 10,000. This accounts for differences in sequencing depth between cells.

# Function: NormalizeData()
# Input: 
#   - pbmc@assays$RNA@layers$counts: sparse matrix (13714 genes × 2638 cells, integer counts)
#   - normalization.method: character ("LogNormalize")
#   - scale.factor: numeric (10000)
# Output: 
#   - pbmc@assays$RNA@layers$data: sparse matrix (13714 genes × 2638 cells)
#   - Data type: numeric (log-normalized values)
#   - Formula: log1p(counts / total_counts * 10000)
#   - Values typically range from 0 to ~10

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

Seurat Object Structure (Step 3)

pbmc
├── assays
│   └── RNA
│       └── layers
│           ├── counts (raw counts)
│           └── data (normalized data)
├── meta.data
└── project: pbmc3k

4. Find Variable Features

We identify highly variable genes that show high cell-to-cell variation. These genes are informative for downstream analysis.

# Function: FindVariableFeatures()
# Input:
#   - pbmc@assays$RNA@layers$data
#       Sparse matrix (13714 genes × 2638 cells) of log-normalized counts
#   - selection.method = "vst"
#       Variance-stabilizing transformation used to model mean–variance relationship
#   - nfeatures = 2000
#       Number of most variable genes to retain
#
# Output:
#   - pbmc@assays$RNA@meta.data  (feature-level metadata; 13714 rows × 8 columns)
#       * vf_vst_counts_mean
#           Mean expression of each gene across all cells (on vst input scale)
#       * vf_vst_counts_variance
#           Observed variance of expression for each gene
#       * vf_vst_counts_variance.expected
#           Expected variance at that mean, from the fitted mean–variance curve
#       * vf_vst_counts_variance.standardized
#           Standardized variance = (observed / expected); measures “extra” variability
#       * vf_vst_counts_variable
#           Logical flag; TRUE if the gene is selected as a highly variable feature
#       * vf_vst_counts_rank
#           Rank of each gene by standardized variance (1 = most variable)
#       * var.features
#           Gene name if it is a variable feature, otherwise NA (redundant helper column)
#       * var.features.rank
#           Rank among variable features (usually matches vf_vst_counts_rank for HVGs)
#
#   - Variable features:
#       VariableFeatures(pbmc) returns the top nfeatures (e.g., 2000) genes
#       where vf_vst_counts_variable == TRUE.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# top 10 highly variable gene
VariableFeatures(pbmc)[1:10]
##  [1] "PPBP"   "LYZ"    "S100A9" "IGLL5"  "GNLY"   "FTL"    "PF4"    "FTH1"  
##  [9] "GNG11"  "S100A8"

Seurat Object Structure (Step 4)

pbmc
├── assays
│   └── RNA
│       ├── layers
│       │   ├── counts
│       │   └── data
│       └── meta.data (vst.mean, vst.variance, vst.variance.expected, vst.variance.standardized, )
└── meta.data

Visualize Variable Features

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# Plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

5. Scale Data

We scale the data to give equal weight to each gene in downstream analyses. This centers the expression of each gene to mean 0 and scales to unit variance.

# Function: ScaleData()
# Input: 
#   - pbmc@assays$RNA@layers$data: sparse matrix (13714 genes × 2638 cells, log-normalized)
#   - features: character vector (13714 gene names, all genes)
# Output: 
#   - pbmc@assays$RNA@layers$scale.data: dense matrix (13714 genes × 2638 cells)
#   - Data type: numeric (scaled values)
#   - For each gene: mean = 0, standard deviation = 1
#   - Values typically range from -3 to +3 (z-scores)
#   - Note: Converts from sparse to dense matrix

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

Seurat Object Structure (Step 5)

pbmc
├── assays
│   └── RNA
│       ├── layers
│       │   ├── counts      # raw counts (sparse)
│       │   └── data        # log-normalized expression
│       └── meta.data       # feature-level metadata (13714 genes × 8 columns)
│           ├── vf_vst_counts_mean                 # mean expression
│           ├── vf_vst_counts_variance             # observed variance
│           ├── vf_vst_counts_variance.expected    # expected variance from mean–variance fit
│           ├── vf_vst_counts_variance.standardized# standardized variance (extra variability)
│           ├── vf_vst_counts_variable             # TRUE/FALSE: selected as HVG
│           ├── vf_vst_counts_rank                 # rank by variability (1 = most variable)
│           ├── var.features                       # gene name if HVG, else NA
│           └── var.features.rank                  # rank among variable features
└── meta.data   

Examine Data Structure in Seurat V5

# View the different data layers
cat("Counts layer (first 5x5):\n")
## Counts layer (first 5x5):
pbmc@assays$RNA@layers$counts[1:5, 1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##               
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
cat("\nNormalized data layer (first 5x5):\n")
## 
## Normalized data layer (first 5x5):
pbmc@assays$RNA@layers$data[1:5, 1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##               
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
cat("\nScaled data layer (first 5x5):\n")
## 
## Scaled data layer (first 5x5):
pbmc@assays$RNA@layers$scale.data[1:5, 1:5]
##             [,1]        [,2]        [,3]        [,4]        [,5]
## [1,] -0.05812316 -0.05812316 -0.05812316 -0.05812316 -0.05812316
## [2,] -0.03357571 -0.03357571 -0.03357571 -0.03357571 -0.03357571
## [3,] -0.04166819 -0.04166819 -0.04166819 -0.04166819 -0.04166819
## [4,] -0.03364562 -0.03364562 -0.03364562 -0.03364562 -0.03364562
## [5,] -0.08223981 -0.08223981 -0.08223981 -0.08223981 -0.08223981

6. Principal Component Analysis (PCA)

We perform PCA on the scaled data using only the variable features (only in selected most variable features after scale). This reduces dimensionality while retaining the most important sources of variation.

# Function: RunPCA()
# Input: 
#   - pbmc@assays$RNA@layers$scale.data: matrix (13714 genes × 2638 cells, scaled)
#   - features: character vector (2000 variable genes)
# Output: 
#   - pbmc@reductions$pca@cell.embeddings: matrix (2638 cells × 50 PCs)
#     - Data type: numeric (PCA scores, cell coordinates in PC space)
#     - Each row is a cell, each column is a principal component
#   - pbmc@reductions$pca@feature.loadings: matrix (2000 genes × 50 PCs)
#     - Data type: numeric (gene loadings/weights)
#     - Each row is a gene, each column is a principal component
#   - pbmc@reductions$pca@stdev: numeric vector (50 values, standard deviation of each PC)

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

Seurat Object Structure (Step 6)

pbmc
├── assays
│   └── RNA
│       └── layers (counts, data, scale.data)
├── meta.data
└── reductions
    └── pca
        ├── cell.embeddings (PCA scores: 2638 cells × 50 PCs)
        └── feature.loadings (PCA loadings: 2000 genes × 50 PCs)

Examine PCA Results

# Print the genes with highest loadings for the first 5 PCs
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7

Seurat Object Structure

The Seurat object now contains:

pbmc
├── assays
│   └── RNA (counts, data, scale.data layers)
├── meta.data (cell-level metadata)
├── reductions
│   └── pca
│       ├── cell.embeddings (PCA scores: 2638 cells × 50 PCs)
│       └── feature.loadings (PCA loadings: 2000 genes × 50 PCs)
└── ...

Visualize PCA Results

PC Loadings

Visualize loadings for PC1 and PC2:

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

Visualize loadings for PC1 only:

VizDimLoadings(pbmc, dims = 1, reduction = "pca")

PCA Scatter Plot

DimPlot(pbmc, reduction = "pca") + NoLegend()

Heatmap of PC1

Visualize the top and bottom 500 cells along PC1 component with a heatmap of scaled expression:

# PC1 component, top and bottom 500 cells along pc1 component, heatmap of scaled expression
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

Determine Dimensionality

Use an elbow plot to determine the number of PCs to use for downstream analysis:

ElbowPlot(pbmc)

We chose 10 here, but encourage users to consider the following:

  • Dendritic cell and NK aficionados may recognize that genes strongly associated with PCs 12 and 13 define rare immune subsets (i.e. MZB1 is a marker for plasmacytoid DCs). However, these groups are so rare, they are difficult to distinguish from background noise for a dataset of this size without prior knowledge.
  • We encourage users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As you will observe, the results often do not differ dramatically.
  • We advise users to err on the higher side when choosing this parameter. For example, performing downstream analyses with only 5 PCs does significantly and adversely affect results.

7. Cluster the Cells

Understanding the Clustering Process

The clustering workflow involves two main steps:

  1. FindNeighbors():
    • Uses first 10 PCs to calculate distances
    • Each cell finds k nearest neighbors → KNN graph
    • Uses Jaccard similarity to reweight edges → SNN graph ($RNA_snn)
  2. FindClusters():
    • Runs Louvain/SLM on the SNN graph
    • Finds communities based on modularity → clusters
    • Resolution controls granularity: higher value = more clusters
    • Results written to meta.data$seurat_clusters (new column) and Idents(pbmc)
# Function: FindNeighbors()
# Input: 
#   - pbmc@reductions$pca@cell.embeddings: matrix (2638 cells × 50 PCs)
#   - dims: numeric vector (1:10, use first 10 PCs)
# Output: 
#   - pbmc@graphs$RNA_nn: sparse matrix (2638 × 2638, KNN graph)
#     - Data type: numeric, values 0 or 1 (binary adjacency)
#   - pbmc@graphs$RNA_snn: sparse matrix (2638 × 2638, SNN graph)
#     - Data type: numeric, values 0 to 1 (weighted edges, Jaccard similarity)

# Function: FindClusters()
# Input: 
#   - pbmc@graphs$RNA_snn: sparse matrix (2638 × 2638)
#   - resolution: numeric (0.5)
# Output: 
#   - pbmc@meta.data$seurat_clusters: factor (2638 values, cluster IDs: 0-8)
#   - pbmc@active.ident: factor (2638 values, same as seurat_clusters)
#   - Data type: categorical (9 clusters numbered 0 to 8)

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 95927
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8728
## Number of communities: 9
## Elapsed time: 0 seconds

Seurat Object Structure (Step 7)

pbmc
├── assays
│   └── RNA
├── meta.data (added: seurat_clusters)
├── reductions
│   └── pca
├── graphs
│   ├── RNA_nn (KNN graph: 2638 × 2638 sparse matrix)
│   └── RNA_snn (SNN graph: 2638 × 2638 sparse matrix)
└── active.ident (cluster assignments)

Examine Graph Structure

# View available graphs
names(pbmc@graphs)
## [1] "RNA_nn"  "RNA_snn"
# "RNA_nn"  "RNA_snn"

# Sparse matrix of edge weights
pbmc@graphs$RNA_snn
## A Graph object containing 2638 cells
# KNN graph: 2638 * 2638 0 or 1 matrix (1 is in this cell k closest distance)
RNA_nn <- pbmc@graphs$RNA_nn
dim(RNA_nn)               # n_cells x n_cells
## [1] 2638 2638
RNA_nn[1:5, 1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                  AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AAACATACAACCAC-1                1                .                .
## AAACATTGAGCTAC-1                .                1                .
## AAACATTGATCAGC-1                .                .                1
## AAACCGTGCTTCCG-1                .                .                .
## AAACCGTGTATGCG-1                .                .                .
##                  AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AAACATACAACCAC-1                .                .
## AAACATTGAGCTAC-1                .                .
## AAACATTGATCAGC-1                .                .
## AAACCGTGCTTCCG-1                1                .
## AAACCGTGTATGCG-1                .                1
# SNN graph: 2638 * 2638 0 to 1 matrix (quantify their similarity or overlap in knn)
RNA_snn <- pbmc@graphs$RNA_snn
dim(RNA_snn)
## [1] 2638 2638
RNA_snn[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##                                                             
## AAACATACAACCAC-1 1.0000000 . . . . . 0.1111111 .         . .
## AAACATTGAGCTAC-1 .         1 . . . . .         .         . .
## AAACATTGATCAGC-1 .         . 1 . . . .         .         . .
## AAACCGTGCTTCCG-1 .         . . 1 . . .         .         . .
## AAACCGTGTATGCG-1 .         . . . 1 . .         .         . .
## AAACGCACTGGTAC-1 .         . . . . 1 .         .         . .
## AAACGCTGACCAGT-1 0.1111111 . . . . . 1.0000000 0.1111111 . .
## AAACGCTGGTTCTT-1 .         . . . . . 0.1111111 1.0000000 . .
## AAACGCTGTAGCCA-1 .         . . . . . .         .         1 .
## AAACGCTGTTTCTG-1 .         . . . . . .         .         . 1
range(RNA_snn[RNA_snn != 0])
## [1] 0.08108108 1.00000000
# 0.08108108 1.00000000

View Cluster Assignments

# Storage places
# View first 5 cluster assignments
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                2                3                2                1 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8
# Clusters are also stored in metadata
head(pbmc$seurat_clusters,5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                2                3                2                1 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

8. UMAP Visualization

Run UMAP for 2D visualization of the data.

# Function: RunUMAP()
# Input: 
#   - pbmc@reductions$pca@cell.embeddings: matrix (2638 cells × 50 PCs)
#   - dims: numeric vector (1:10, use first 10 PCs)
# Output: 
#   - pbmc@reductions$umap@cell.embeddings: matrix (2638 cells × 2 dimensions)
#     - Data type: numeric (UMAP coordinates)
#     - Column names: UMAP_1, UMAP_2
#     - Values typically range from -10 to +10
#     - Each row represents a cell's 2D position for visualization

pbmc <- RunUMAP(pbmc, dims = 1:10)

Seurat Object Structure (Step 8)

pbmc
├── assays
│   └── RNA
├── meta.data (seurat_clusters)
├── reductions
│   ├── pca
│   └── umap
│       └── cell.embeddings (UMAP coordinates: 2638 cells × 2 dimensions)
├── graphs (RNA_nn, RNA_snn)
└── active.ident

Visualize Clusters on UMAP

Note that you can set label = TRUE or use the LabelClusters function to help label individual clusters:

DimPlot(pbmc, reduction = "umap", label = TRUE)

9. Differential Expression Analysis

Find Markers for Specific Clusters

Cluster 2 Markers

Find all marker genes upregulated in cluster 2 compared to all other cells:

# Function: FindMarkers()
# Input: 
#   - pbmc@assays$RNA@layers$data: sparse matrix (13714 genes × 2638 cells, log-normalized)
#   - pbmc@active.ident: factor (cluster assignments)
#   - ident.1: numeric or character (cluster 2)
# Output: 
#   - cluster2.markers: data.frame (genes × 5 columns)
#     - p_val: numeric (p-value from differential expression test)
#     - avg_log2FC: numeric (average log2 fold change)
#     - pct.1: numeric (% of cells in cluster 2 expressing the gene)
#     - pct.2: numeric (% of cells in other clusters expressing the gene)
#     - p_val_adj: numeric (adjusted p-value)
#   - Rows sorted by p_val (most significant first)

cluster2.markers <- FindMarkers(pbmc, ident.1 = 2)
head(cluster2.markers, n = 5)
##             p_val avg_log2FC pct.1 pct.2    p_val_adj
## IL32 2.892340e-90  1.3070772 0.947 0.465 3.966555e-86
## LTB  1.060121e-86  1.3312674 0.981 0.643 1.453850e-82
## CD3D 8.794641e-71  1.0597620 0.922 0.432 1.206097e-66
## IL7R 3.516098e-68  1.4377848 0.750 0.326 4.821977e-64
## LDHB 1.642480e-67  0.9911924 0.954 0.614 2.252497e-63

Cluster 5 vs Clusters 0 and 3

Find markers distinguishing cluster 5 from specific other clusters:

cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3))
head(cluster5.markers, n = 5)
##                       p_val avg_log2FC pct.1 pct.2     p_val_adj
## FCGR3A        8.246578e-205   6.794969 0.975 0.040 1.130936e-200
## IFITM3        1.677613e-195   6.192558 0.975 0.049 2.300678e-191
## CFD           2.401156e-193   6.015172 0.938 0.038 3.292945e-189
## CD68          2.900384e-191   5.530330 0.926 0.035 3.977587e-187
## RP11-290F20.3 2.513244e-186   6.297999 0.840 0.017 3.446663e-182

Find Markers for All Clusters

Find marker genes for every cluster compared to all remaining cells (positive markers only):

# Function: FindAllMarkers()
# Input: 
#   - pbmc@assays$RNA@layers$data: sparse matrix (13714 genes × 2638 cells, log-normalized)
#   - pbmc@active.ident: factor (9 clusters: 0-8)
#   - only.pos: logical (TRUE, return only upregulated genes)
# Output: 
#   - pbmc.markers: data.frame (genes × 7 columns)
#     - p_val: numeric (p-value)
#     - avg_log2FC: numeric (average log2 fold change)
#     - pct.1: numeric (% cells in cluster expressing gene)
#     - pct.2: numeric (% cells in other clusters expressing gene)
#     - p_val_adj: numeric (adjusted p-value)
#     - cluster: factor (cluster ID, 0-8)
#     - gene: character (gene name)
#   - Contains markers for all 9 clusters

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)

# Filter for genes with strong upregulation (log2FC > 1)
pbmc.markers %>%
  group_by(cluster) %>%
  dplyr::filter(avg_log2FC > 1)
## # A tibble: 7,019 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
##  1 3.75e-112       1.21 0.912 0.592 5.14e-108 0       LDHB     
##  2 9.57e- 88       2.40 0.447 0.108 1.31e- 83 0       CCR7     
##  3 1.15e- 76       1.06 0.845 0.406 1.58e- 72 0       CD3D     
##  4 1.12e- 54       1.04 0.731 0.4   1.54e- 50 0       CD3E     
##  5 1.35e- 51       2.14 0.342 0.103 1.86e- 47 0       LEF1     
##  6 1.94e- 47       1.20 0.629 0.359 2.66e- 43 0       NOSIP    
##  7 2.81e- 44       1.53 0.443 0.185 3.85e- 40 0       PIK3IP1  
##  8 6.27e- 43       1.99 0.33  0.112 8.60e- 39 0       PRKCQ-AS1
##  9 1.16e- 40       2.70 0.2   0.04  1.59e- 36 0       FHIT     
## 10 1.34e- 34       1.96 0.268 0.087 1.84e- 30 0       MAL      
## # ℹ 7,009 more rows

ROC Test for Classification Power

Seurat has several tests for differential expression which can be set with the test.use parameter (see Seurat DE vignette for details). For example, the ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).

cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, 
                                test.use = "roc", only.pos = TRUE)
head(cluster0.markers)
##       myAUC  avg_diff power avg_log2FC pct.1 pct.2
## RPS12 0.827 0.5059247 0.654  0.7387061 1.000 0.991
## RPS6  0.826 0.4762402 0.652  0.6934523 1.000 0.995
## RPS27 0.824 0.5047203 0.648  0.7372604 0.999 0.992
## RPL32 0.821 0.4294911 0.642  0.6266075 0.999 0.995
## RPS14 0.811 0.4334133 0.622  0.6336957 1.000 0.994
## RPS25 0.803 0.5196163 0.606  0.7689940 0.997 0.975

Visualize Marker Expression

Violin Plots (Classical Visualization)

# Normalized data
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# You can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

Feature Plots

Visualize multiple marker genes on UMAP:

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", 
                               "FCGR3A", "LYZ", "PPBP", "CD8A"))

Top Markers Heatmap

Extract the top 10 marker genes per cluster and visualize in a heatmap:

# Extract top 10 genes per cluster
# 1. group_by(cluster)
#    → "Split" the table by cluster.
#    Now all subsequent operations happen within each cluster separately.
# 2. filter(avg_log2FC > 1)
#    → Keep only genes that are strongly upregulated in that cluster
#    (log2 fold change > 1 means at least ~2× higher expression).
# 3. slice_head(n = 10)
#    → For each cluster, keep the first 10 rows (after filtering).
#    Since FindAllMarkers usually sorts by p_val or p_val_adj, this means:
#    "Take the top 10 strongest / most significant marker genes per cluster."
# 4. ungroup()
#    → Remove the grouping so top10 is now just a normal tibble.
# 5. -> top10
#    → Save the result as a new object top10.
pbmc.markers %>%
  group_by(cluster) %>%
  dplyr::filter(avg_log2FC > 1) %>%
  slice_head(n = 10) %>%
  ungroup() -> top10
# Draw a heatmap of the top 10 upregulated marker genes for each cluster across all cells.
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

Ridge Plots

Ridge plots show the distribution of expression across clusters:

# Single gene
RidgePlot(pbmc, features = "MS4A1")

# Multiple genes, faceted
RidgePlot(pbmc, features = c("MS4A1", "CD79A", "CD3D"), ncol = 1)

Feature-Feature Correlations

Examine feature correlation among all cells:

# B cell markers
FeatureScatter(pbmc, feature1 = "MS4A1", feature2 = "CD79A")

# T cell markers
FeatureScatter(pbmc, feature1 = "CD3D", feature2 = "CCR7")

Cell-Cell Correlations

Cell correlation among variable features:

CellScatter(
  pbmc,
  cell1 = colnames(pbmc)[1],
  cell2 = colnames(pbmc)[2],
  features = VariableFeatures(pbmc)
)

Dot Plot of Marker Genes

Classical bubble plot of marker genes. Dot plots are a concise way to visualize marker gene expression across clusters. The size of the dot represents the percentage of cells expressing the gene, and the color intensity represents the average expression level.

markers.to.plot <- c("MS4A1", "CD79A",    # B cells
                     "CD3D", "IL7R",      # CD4 T
                     "LTB", "NKG7")       # others

DotPlot(pbmc, features = markers.to.plot) + RotatedAxis()

10. Cell Type Annotation

Based on canonical marker genes, we assign cell type identities to each cluster:

  • Cluster 0: Naive CD4+ T cells
  • Cluster 1: CD14+ Monocytes
  • Cluster 2: Memory CD4+ T cells
  • Cluster 3: B cells
  • Cluster 4: CD8+ T cells
  • Cluster 5: FCGR3A+ Monocytes
  • Cluster 6: NK cells
  • Cluster 7: Dendritic Cells
  • Cluster 8: Platelets
# Function: RenameIdents()
# Input: 
#   - pbmc@active.ident: factor (9 levels: 0-8, numeric cluster IDs)
#   - new.cluster.ids: named character vector (9 cell type labels)
# Output: 
#   - pbmc@active.ident: factor (9 levels, cell type names)
#     - Data type: categorical (character labels)
#     - Levels: "Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", 
#               "FCGR3A+ Mono", "NK", "DC", "Platelet"
#   - Note: Original numeric clusters preserved in pbmc@meta.data$seurat_clusters

new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", 
                     "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)

Seurat Object Structure (Step 10)

pbmc
├── assays
│   └── RNA (counts, data, scale.data layers)
├── meta.data (seurat_clusters, nCount_RNA, nFeature_RNA, percent.mt)
├── reductions
│   ├── pca (50 PCs)
│   └── umap (2D coordinates)
├── graphs (RNA_nn, RNA_snn)
└── active.ident (cell type labels: Naive CD4 T, CD14+ Mono, Memory CD4 T, etc.)

Final Annotated UMAP

DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

Save Results

Save the final Seurat object for future use (R Data Serialization, one object only):

saveRDS(pbmc, file = "../pbmc_tutorial.rds")

Session Info

sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] future_1.67.0      patchwork_1.3.2    Seurat_5.3.1       SeuratObject_5.2.0
## [5] sp_2.2-0           dplyr_1.1.4       
## 
## loaded via a namespace (and not attached):
##   [1] deldir_2.0-4           pbapply_1.7-4          gridExtra_2.3         
##   [4] rlang_1.1.6            magrittr_2.0.4         RcppAnnoy_0.0.22      
##   [7] otel_0.2.0             spatstat.geom_3.6-0    matrixStats_1.5.0     
##  [10] ggridges_0.5.7         compiler_4.4.3         png_0.1-8             
##  [13] vctrs_0.6.5            reshape2_1.4.4         stringr_1.6.0         
##  [16] pkgconfig_2.0.3        fastmap_1.2.0          labeling_0.4.3        
##  [19] utf8_1.2.6             promises_1.5.0         rmarkdown_2.30        
##  [22] purrr_1.2.0            xfun_0.54              cachem_1.1.0          
##  [25] jsonlite_2.0.0         goftest_1.2-3          later_1.4.4           
##  [28] spatstat.utils_3.2-0   irlba_2.3.5.1          parallel_4.4.3        
##  [31] cluster_2.1.8.1        R6_2.6.1               ica_1.0-3             
##  [34] bslib_0.9.0            stringi_1.8.7          RColorBrewer_1.1-3    
##  [37] spatstat.data_3.1-9    limma_3.62.2           reticulate_1.44.0     
##  [40] parallelly_1.45.1      spatstat.univar_3.1-4  lmtest_0.9-40         
##  [43] jquerylib_0.1.4        scattermore_1.2        Rcpp_1.1.0            
##  [46] knitr_1.50             tensor_1.5.1           future.apply_1.20.0   
##  [49] zoo_1.8-14             R.utils_2.13.0         sctransform_0.4.2     
##  [52] httpuv_1.6.16          Matrix_1.7-4           splines_4.4.3         
##  [55] igraph_2.2.1           tidyselect_1.2.1       abind_1.4-8           
##  [58] rstudioapi_0.17.1      yaml_2.3.10            spatstat.random_3.4-2 
##  [61] codetools_0.2-20       miniUI_0.1.2           spatstat.explore_3.5-3
##  [64] listenv_0.10.0         lattice_0.22-7         tibble_3.3.0          
##  [67] plyr_1.8.9             withr_3.0.2            shiny_1.11.1          
##  [70] S7_0.2.0               ROCR_1.0-11            evaluate_1.0.5        
##  [73] Rtsne_0.17             fastDummies_1.7.5      survival_3.8-3        
##  [76] polyclip_1.10-7        fitdistrplus_1.2-4     pillar_1.11.1         
##  [79] KernSmooth_2.23-26     plotly_4.11.0          generics_0.1.4        
##  [82] RcppHNSW_0.6.0         ggplot2_4.0.0          scales_1.4.0          
##  [85] globals_0.18.0         xtable_1.8-4           glue_1.8.0            
##  [88] lazyeval_0.2.2         tools_4.4.3            data.table_1.17.8     
##  [91] RSpectra_0.16-2        RANN_2.6.2             dotCall64_1.2         
##  [94] cowplot_1.2.0          grid_4.4.3             tidyr_1.3.1           
##  [97] nlme_3.1-168           cli_3.6.5              spatstat.sparse_3.1-0 
## [100] spam_2.11-1            viridisLite_0.4.2      uwot_0.2.3            
## [103] gtable_0.3.6           R.methodsS3_1.8.2      sass_0.4.10           
## [106] digest_0.6.37          progressr_0.17.0       ggrepel_0.9.6         
## [109] htmlwidgets_1.6.4      farver_2.1.2           R.oo_1.27.1           
## [112] htmltools_0.5.8.1      lifecycle_1.0.4        httr_1.4.7            
## [115] statmod_1.5.1          mime_0.13              MASS_7.3-65